Please Note

This notebook is the combination of deliverable 1 and 2, with all my code & graph & analysis, Thanks!

I. Write-Up

  1. Why accidents analysis is important
    • For Transportation Planners: the analysis of accidents will help transport planners to find out the shortage of current transportation systems and propose future regulations. Some frequent accident places, especially with high severity (like highway & intersection) may need to add some signs or need to control the speed during a certain period to prevent future accidents.
    • For Drivers: the model can be added into Google Map as an accident plug-in: when people want to navigate the trip through the map, the model can calculate the accidents’ probability of the place that the route will pass based on current weather, time and geographical information. It’s a useful tool to get drivers’ attention and inform them of the risk in advance.
  1. Methodology
    • Grid-based analysis (10km*10km grid)
    • Feature
      • Location: State, Number of related facilities(e.g. railway, crossing, junction, stop, traffic signal, turning loop
      • Time: time of day, day of the week, month, year
      • Temperature, Wind Speed, Visibility, Humidity
    • Model Random Forest: Calculate the number of severe accidents (severity = 3 or 4) within each grid at a certain time as the dependent variable. While most of the grid at a certain time only has one accident, thus, I decide to view it as an accident classification problem: have an accident or no accident.

II. Exploratory Analysis

In [ ]:
pip install geopandas
!pip install plotly
pip install rtree
Requirement already satisfied: geopandas in /opt/anaconda3/envs/new/lib/python3.6/site-packages (0.7.0)
Requirement already satisfied: fiona in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from geopandas) (1.8.4)
Requirement already satisfied: shapely in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from geopandas) (1.7.0)
Requirement already satisfied: pandas>=0.23.0 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from geopandas) (1.0.3)
Requirement already satisfied: pyproj>=2.2.0 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from geopandas) (2.6.1.post1)
Requirement already satisfied: attrs>=17 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from fiona->geopandas) (19.3.0)
Requirement already satisfied: click<8,>=4.0 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from fiona->geopandas) (7.1.1)
Requirement already satisfied: cligj>=0.5 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from fiona->geopandas) (0.5.0)
Requirement already satisfied: click-plugins>=1.0 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from fiona->geopandas) (1.1.1)
Requirement already satisfied: six>=1.7 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from fiona->geopandas) (1.14.0)
Requirement already satisfied: munch in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from fiona->geopandas) (2.5.0)
Requirement already satisfied: pytz>=2017.2 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from pandas>=0.23.0->geopandas) (2020.1)
Requirement already satisfied: numpy>=1.13.3 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from pandas>=0.23.0->geopandas) (1.18.1)
Requirement already satisfied: python-dateutil>=2.6.1 in /opt/anaconda3/envs/new/lib/python3.6/site-packages (from pandas>=0.23.0->geopandas) (2.8.1)
Note: you may need to restart the kernel to use updated packages.
In [ ]:
import pandas as pd
import os
import seaborn as sns
import numpy as np 
import matplotlib.pyplot as plt 
from google.colab import drive
import geopandas as gpd
import plotly.graph_objects as go
In [ ]:
df = pd.read_csv('/Users/zhangjing/Desktop/cis545/final/US_Accidents_Dec19.csv')
pd.set_option('display.max_columns', None)
In [ ]:
# clean time related variables
import datetime

df['year'] = pd.DatetimeIndex(df['Start_Time']).year
#df = df[df['year'].isin(['2016','2017', '2018', '2019'])] # drop other years 
df['month'] = pd.DatetimeIndex(df['Start_Time']).month
df['hour'] = pd.DatetimeIndex(df['Start_Time']).hour
df['weekday'] = pd.DatetimeIndex(df['Start_Time']).dayofweek
df['week'] = pd.DatetimeIndex(df['Start_Time']).week

condition = [
             df['weekday']==0, df['weekday']==1, df['weekday']==2, 
             df['weekday']==3, df['weekday']==4, df['weekday']==5, 
             df['weekday']==6, 
]

choices = ['Mon','Tue','Wed',"Thur","Fri","Sat","Sun"]
df['weekday2'] = np.select(condition, choices, default='black')
df['ref'] = 1

1- Geography

The plot of accidents from 2016 to 2019 shows that most of the accidents happened on the highway and the city's main road. California has the highest number of accidents in the last three years.

In [ ]:
plt.figure(figsize=(50,30))
plt.title('Plot of Traffic Accident from 2016 to 2019', fontsize = 50)
plt.xlabel('Longitude',fontsize=40)
plt.ylabel('Latitude',fontsize=40)
plt.plot(df.Start_Lng, df.Start_Lat, ".", alpha=0.5)
plt.xticks(fontsize=30)
plt.yticks(fontsize=30)
plt.show()
In [ ]:
count_by_state = pd.value_counts(df['State']).reset_index()
count_by_state.columns = ['State', 'Value']

state_high = pd.value_counts(df['State']).reset_index()

fig = go.Figure(data=go.Choropleth(
    locations = count_by_state.State,
    z = count_by_state.Value,  
    locationmode = 'USA-states',   
    colorscale = 'tempo',
    colorbar_title = "Count of Accidents",
))

fig.update_layout(
    title_text = 'US Traffic Accidents by State 2016-2019',
    geo_scope='usa', # limite map scope to USA
)

fig.show()

2 - Time & Severity

This is the plot of the number of accidents in each week from 2016 to 2019. Its fluctuation indicates that the number of accidents changes periodically, e.g. there is a stark drop in the last week of each year. As for the change of severity, severity 2 type of accidents fluctuate significantly by weeks and it experienced an obvious increase from 2019 July (31st week). However, other types of accidents, especially severe accidents (severity 3 & 4) don't have dramatic quantity changes from 2016 to 2019.

Now, let's zoom into the data by month and week to find if there are potential regular patterns.

In [ ]:
fig, ax = plt.subplots(figsize=(13,6))
ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
week_year = df.groupby(['year','week','Severity']).count()['ref'].unstack()
week_year.plot(ax=ax)
ax.set_xlabel('Week', fontsize=14)
ax.set_ylabel('Count of Accidents',fontsize=14)
Out[ ]:
Text(0, 0.5, 'Count of Accidents')

Month

The left diagram shows that the time between August to December is the peak crash period over the span of a year. It seems that Severity 2 changes a lot by time especially after July, while other types of accidents are relatively flat compared with the change of Severity 2

In [ ]:
fig, (ax1,ax2) = plt.subplots(1,2,figsize=(13,4))
# month & total count 
ax1.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
time_day = pd.value_counts(df['month']).reset_index()
time_day.sort_values('index', inplace=True)
ax1.plot('index', 'month', data = time_day)
ax1.set_title('Month & Total Accident count (2016-2019)', fontsize=14) 
ax1.set_xticks(list(time_day.index))
ax1.set_xlabel('Month', fontsize=14)
ax1.set_ylabel('Count of Traffic Accident', fontsize=14) 

# month & severity
time_day = df.groupby(['month','Severity']).count()['ref'].unstack()
time_day.plot(ax=ax2)

ax2.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax2.set_xticks(list(time_day.index))
ax2.set_xlabel('Month', fontsize=14)
#ax2.set_ylabel('', fontsize=14) 
ax2.set_title('Month & Accident Severity', fontsize=14) 
plt.show()

Time of Day

The left plot shows that most of the accidents happened at weekday rush hour (e.g. 7:00-8:00 and 17:00-18:00) which is easy to understand as the traffic volume during that time is also huge. Obviously, there are fewer traffic accidents at the weekend, and most of them concentrated around 11:00 to 15:00. However, accident severity doesn't show a strong relationship with time of day (as the right diagram shows) because the whole tendency is similar to the total count.

In [ ]:
import matplotlib
fig, (ax,ax1) = plt.subplots(1,2,figsize=(13,5))

# time & total
df['ref'] = 1
time_day = df.groupby(['hour','weekday2']).count()['ref'].unstack()
time_day.plot(ax=ax)

ax.get_yaxis().set_major_formatter(matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.set_xticks(list(time_day.index))
ax.set_xlabel('Time of Day', fontsize=14)
ax.set_ylabel('Number of Traffic Accidents', fontsize=14) 
ax.set_title('Count of Traffic Accident by Time', fontsize=14) 

# time & severity
time_day = df.groupby(['hour','Severity']).count()['ref'].unstack()
time_day.plot(ax=ax1)

ax1.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax1.set_xticks(list(time_day.index))
ax1.set_xlabel('Time of Day', fontsize=14)
ax1.set_title('Time & Accident Severity', fontsize=14) 
Out[ ]:
Text(0.5, 1.0, 'Time & Accident Severity')

3 - Weather

Before we dive into the exploration of weather-related features, it's important to check the number of NA of each feature. This diagram shows that some weather-related variables are kind of useless, e.g. precipitation, wind chill, because their NA percentage is high to 30% or even 60%. Thus, replace the NA by average or 0 is not suitable due to a large number of missing. Therefore, those variables will be dropped.

In [ ]:
NA_count = df.isna().sum().reset_index()
NA_count.columns = ['index', 'Value']
NA_count['count'] = (NA_count['Value']*100)/len(df)
NA_count = NA_count[NA_count['count']>0].sort_values('count')

fig, ax = plt.subplots(figsize=(10,6))

NA_count.plot.barh('index','count', width=0.8,color='c',align='center',linewidth=1.2,ax=ax)
ax.set_xlabel('NA percentage', fontsize=14)
ax.set_ylabel('Name of Feature',fontsize=14)
ax.set_title('Percentage of NA in each feature',fontsize=16)
Out[ ]:
Text(0.5, 1.0, 'Percentage of NA in each feature')
In [ ]:
fig, ax = plt.subplots(figsize=(13,6))

weather_condition = df['Weather_Condition'].value_counts().sort_values(ascending=False).head(30)
weather_condition.plot.bar(width=0.8,color='c',align='center',linewidth=1.2)
ax.get_yaxis().set_major_formatter(
    matplotlib.ticker.FuncFormatter(lambda x, p: format(int(x), ',')))
ax.set_xlabel('Weather condition', fontsize=14)
ax.set_ylabel('Count of Traffic Accident',fontsize=14)
ax.set_title('Traffic Accident & Weather Condition',fontsize=16)
Out[ ]:
Text(0.5, 1.0, 'Traffic Accident & Weather Condition')

Let's start with the analysis of weather condition description. This bar chart calculates the number of traffic accidents in different weather conditions. It seems that most of the accidents occur in Clear,Mostly Cloudy,Overcastdays, while it doesn't mean that those weather conditions can cause accidents. A more accurate method is focusing on the percentage, aka, the percentage of severity under each weather condition. In that case, we can figure out which types of weather will result in which types of accidents.

In [ ]:
Weather_Cond = df.groupby(['Weather_Condition','Severity']).count()['ref'].reset_index()
#['Weather_Condition']
severity_1_by_weather = Weather_Cond[Weather_Cond['Severity']==1]['ref']
severity_2_by_weather = Weather_Cond[Weather_Cond['Severity']==2]['ref']
severity_3_by_weather = Weather_Cond[Weather_Cond['Severity']==3]['ref']
severity_4_by_weather = Weather_Cond[Weather_Cond['Severity']==4]['ref']
In [ ]:
weather_count = df.groupby(['Weather_Condition','Severity']).count()['ref'].reset_index()
weather_df2 = weather_count[weather_count['ref']>20]
weather_count2 = weather_df2.groupby(['Weather_Condition']).sum()[['ref']]

weather_df3 = pd.merge(weather_df2, weather_count2, how="left", on ='Weather_Condition')
weather_df3['percent']= (weather_df3['ref_x']*100)/weather_df3['ref_y']
weather_df3

weather_Severity1 = weather_df3[weather_df3['Severity']==1]
weather_Severity2 = weather_df3[weather_df3['Severity']==2].sort_values('percent',ascending=False)
weather_Severity3 = weather_df3[weather_df3['Severity']==3]
weather_Severity4 = weather_df3[weather_df3['Severity']==4]
In [ ]:
plt.figure(figsize=(25, 10))
plt.bar(weather_Severity2.Weather_Condition, weather_Severity2.percent, label='Severity 2')
plt.bar(weather_Severity3.Weather_Condition, weather_Severity3.percent, label='Severity 3')
plt.bar(weather_Severity4.Weather_Condition, weather_Severity4.percent, label='Severity 4')
plt.bar(weather_Severity1.Weather_Condition, weather_Severity1.percent, label='Severity 1')

plt.xticks(rotation=90,fontsize=15)
plt.yticks(fontsize=14)
plt.xlabel("Weather Condition (count>20)", fontsize=18)
plt.ylabel("Severity Percentage", fontsize=18)
plt.title("Severity & Weather Condition", fontsize=25)
plt.legend(fontsize =20)
Out[ ]:
<matplotlib.legend.Legend at 0x1ad0ecc048>

As mentioned before, this stacked bar chart indicates the percentage of accident severity by weather conditions (only select the weather condition when more than 20 cases). For the whole dataset, 67% of the accident is Severity 2, so it's reasonable that for most of the weather conditions, Severity 2 is the majority as blue bar shows. While, notably, the share of severe accidents (Severity 3(orange bar) & Severity 4(green bar)) are significantly high during extreme weather, for example, Blowing Snow, Light Ice Pellets, Light Freezing Rain, Heavy Snow, T-Storm/Windy. As extreme weather is rare, the accidents that occurred during these weathers are also limited compared with the accidents occurred during normal weather. While once there is an accident occurred in extreme weather, it's highly likely to be a severe case.

In [ ]:
#df['precipitation'] = df['Precipitation(in)'].fillna(0, inplace=False)
df = df.dropna(subset=['City','Temperature(F)','Visibility(mi)','Pressure(in)','Humidity(%)',
                                    'Weather_Condition']) 
In [ ]:
df['tem_qcut']=pd.qcut(df['Temperature(F)'], 15)
df['humidity_qcut']=pd.qcut(df['Humidity(%)'], 15)
df2 = df.dropna(subset=['Wind_Speed(mph)']) 
df2['wind_speed_qcut']= pd.qcut(df['Wind_Speed(mph)'], 15,duplicates='drop')
df['visi_qcut']=pd.cut(df['Visibility(mi)'], 15,duplicates='drop')


tem_qcut = df.groupby(['tem_qcut']).mean()['Severity'].reset_index()
humidity_qcut = df.groupby(['humidity_qcut']).mean()['Severity'].reset_index()
visi_qcut = df.groupby(['visi_qcut']).mean()['Severity'].reset_index()
wind_speed_qcut = df2.groupby(['wind_speed_qcut']).mean()['Severity'].reset_index()

fig, (ax,ax1,ax2,ax3) = plt.subplots(4,figsize=(10,14))
fig.subplots_adjust(hspace=0.3)
tem_qcut.plot('tem_qcut','Severity',ax=ax)
#ax.set_title("Mean Severity & Temperature(F)", fontsize=14)
ax.set_xlabel("Temperature catogory (quantile)(F)", fontsize=14)
ax.set_ylabel("Mean Severity", fontsize=14)
ax.set_xticklabels(tem_qcut['tem_qcut'])

humidity_qcut.plot('humidity_qcut','Severity',ax=ax1)
#ax1.set_title("Mean Severity & Humidity(%)", fontsize=14)
ax1.set_xlabel("Humidity catogory(%) (quantile)", fontsize=14)
ax1.set_ylabel("Mean Severity", fontsize=14)
ax1.set_xticklabels(humidity_qcut['humidity_qcut'])

wind_speed_qcut.plot('wind_speed_qcut','Severity',ax=ax2)
ax2.set_xlabel("Wind Speed catogory (mph)(quantile)", fontsize=14)
ax2.set_ylabel("Mean Severity", fontsize=14)
ax2.set_xticklabels(wind_speed_qcut['wind_speed_qcut'])


visi_qcut.plot('visi_qcut','Severity',ax=ax3)
ax3.set_xlabel("Visibility catogory(mi)", fontsize=14)
ax3.set_ylabel("Mean Severity", fontsize=14)
ax3.set_xticklabels(visi_qcut['visi_qcut'],rotation=90)
plt.show()
/opt/anaconda3/envs/new/lib/python3.6/site-packages/ipykernel_launcher.py:4: SettingWithCopyWarning:


A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html#returning-a-view-versus-a-copy

After dropping NA for some useful weather-related features, I explored the relationship between accident severity and Temperature / Humidity / Wind Speed as this series of diagrams plotted. I firstly cut the variables (Temperature / Humidity / Wind Speed) by 10-15 quantile and then group them and calculate the mean severity.

It shows that, as Temperature increases, the severity will decrease. As for Humidity, it can be viewed as a substitution of precipitation (since there are too many NA in precipitation variable), and it shows that higher Humidity associated with severer accidents. The plot didn't tell the strong or obvious relationship between wind speed and severity, while that makes sense to some extent as the wind speed doesn't affect the vehicle's capability or driver's sight. With regard to the Visibility, the relationship is also vague.

III. Grid based

In [ ]:
import shapely.speedups
shapely.speedups.enable()
In [ ]:
import geopandas as gpd

gdf = gpd.GeoDataFrame(
    df, geometry=gpd.points_from_xy(df.Start_Lng, df.Start_Lat))

#grid
path ='/Users/zhangjing/Desktop/cis545/final/grid545_2.geojson'
grid = gpd.read_file(path)

# usa
path ='/Users/zhangjing/Desktop/cis545/final/545usa2.geojson'
usa2 = gpd.read_file(path)

path ='/Users/zhangjing/Desktop/cis545/final/545philly.geojson'
philly = gpd.read_file(path)

gdf.crs = {'init' :'epsg:4326'}
grid.crs = {'init' :'epsg:4326'}
philly.crs = {'init' :'epsg:4326'}
grid1 = grid
grid = grid.reset_index()
grid.rename(columns = {'index':'Grid_ID'}, inplace = True)
#grid

1 - Spatial vidualization & explain

As mentioned before, this project cut Amerian into many 10*10km grids, and then count the number of accidents at a certain time within each grid. This first plot is the grid of the USA and all accidents. The second plot is the zoom-in version of intersected grids (only keep the grids that had accidents).

In [ ]:
fig, ax = plt.subplots(figsize=(50,30))
#df.plot(ax =ax, markersize=3, color="blue")
plt.plot(df.Start_Lng, df.Start_Lat, ".", alpha=0.5)
grid.plot(ax =ax, facecolor="none", edgecolor='black',lw=0.1)
plt.title('All 10*10km grid & all accidents', fontsize = 50)
Out[ ]:
Text(0.5, 1.0, 'All 10*10km grid & all accidents')
In [ ]:
select_grid = all_model['Grid_ID'].drop_duplicates(keep='first').reset_index()
select_grid['ref1'] = 1
select_grid["Grid_ID"] = select_grid["Grid_ID"].astype(int)
grid_select_sf = pd.merge(grid,select_grid, how='right', on=['Grid_ID'])
grid_select_sf = grid_select_sf[grid_select_sf.geometry.type == 'Polygon']
grid_select_sf

fig, ax = plt.subplots(figsize=(30,20))
#join_new.plot(ax =ax, facecolor="none", edgecolor='black',lw=0.01)
philly.plot(ax=ax,facecolor="none",edgecolor='black',lw=5)
grid_select_sf.plot(ax =ax, facecolor="none", edgecolor='black',lw=0.3)
plt.plot(df.Start_Lng, df.Start_Lat, ".",ms=0.5)

ax.set_xlim(-78, -72)  
ax.set_ylim(39, 43)  
plt.title('Selected 10*10km grid & all accidents & Philly boundary', fontsize = 30)
Out[ ]:
Text(0.5, 1.0, 'Selected 10*10km grid & all accidents & Philly boundary')

For the third plot, we zoomed into Philly to get more sense of grid and accidents. It seems that the number of accidents within Philadelphia are much less than the accidents outside Philadelphia. Also Ablington & Plymouth Meeting (northwest of Philadelphia) suffer severe traffic safety issues in the past three years. It might be attributed to the traffic regulation of the county or the issue with the data. I prefer to believe the latter reason as the boundary is over-obvious: the point within and outside Philadelphia present two totally different patterns.

In [ ]:
fig, ax = plt.subplots(figsize=(24,24))
#usa2.plot(ax=ax,facecolor="none", edgecolor='none',lw=0.7)
philly.plot(ax=ax,facecolor="none",edgecolor='black',lw=4)
plt.plot(df.Start_Lng, df.Start_Lat, ".")
grid.plot(ax=ax,facecolor="none", edgecolor='black',lw=0.5)
ax.set_xlim(-75.4, -74.9)  
ax.set_ylim(39.8, 40.2)  
plt.title('Zoom into Philadephia', fontsize = 30)
plt.show()
In [ ]:
 

2 - Spatial join: grid & geo dataframe

In [ ]:
import rtree
import shapely.speedups
shapely.speedups.enable()
In [ ]:
# spatial join the 
join = gpd.sjoin(gdf, grid, how="left")
join
Out[ ]:
ID Source TMC Severity Start_Time End_Time Start_Lat Start_Lng End_Lat End_Lng Distance(mi) Description Number Street Side City County State Zipcode Country Timezone Airport_Code Weather_Timestamp Temperature(F) Wind_Chill(F) Humidity(%) Pressure(in) Visibility(mi) Wind_Direction Wind_Speed(mph) Precipitation(in) Weather_Condition Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop Sunrise_Sunset Civil_Twilight Nautical_Twilight Astronomical_Twilight year month hour weekday week weekday2 ref tem_qcut humidity_qcut visi_qcut geometry index_right Grid_ID
0 A-1 MapQuest 201.0 3 2016-02-08 05:46:00 2016-02-08 11:00:00 39.865147 -84.058723 NaN NaN 0.010 Right lane blocked due to accident on I-70 Eas... NaN I-70 E R Dayton Montgomery OH 45424 US US/Eastern KFFO 2016-02-08 05:58:00 36.9 NaN 91.0 29.68 10.0 Calm NaN 0.02 Light Rain False False False False False False False False False False False False False Night Night Night Night 2016 2 5 0 6 Mon 1 (31.0, 39.9] (87.0, 92.0] (9.333, 18.667] POINT (-84.05872 39.86515) 138601 138601
1 A-2 MapQuest 201.0 2 2016-02-08 06:07:59 2016-02-08 06:37:59 39.928059 -82.831184 NaN NaN 0.010 Accident on Brice Rd at Tussing Rd. Expect del... 2584.0 Brice Rd L Reynoldsburg Franklin OH 43068-3402 US US/Eastern KCMH 2016-02-08 05:51:00 37.9 NaN 100.0 29.65 10.0 Calm NaN 0.00 Light Rain False False False False False False False False False False False False False Night Night Night Day 2016 2 6 0 6 Mon 1 (31.0, 39.9] (96.0, 100.0] (9.333, 18.667] POINT (-82.83118 39.92806) 139340 139340
2 A-3 MapQuest 201.0 2 2016-02-08 06:49:27 2016-02-08 07:19:27 39.063148 -84.032608 NaN NaN 0.010 Accident on OH-32 State Route 32 Westbound at ... NaN State Route 32 R Williamsburg Clermont OH 45176 US US/Eastern KI69 2016-02-08 06:56:00 36.0 33.3 100.0 29.67 10.0 SW 3.5 NaN Overcast False False False False False False False False False False False True False Night Night Day Day 2016 2 6 0 6 Mon 1 (31.0, 39.9] (96.0, 100.0] (9.333, 18.667] POINT (-84.03261 39.06315) 131372 131372
3 A-4 MapQuest 201.0 3 2016-02-08 07:23:34 2016-02-08 07:53:34 39.747753 -84.205582 NaN NaN 0.010 Accident on I-75 Southbound at Exits 52 52B US... NaN I-75 S R Dayton Montgomery OH 45417 US US/Eastern KDAY 2016-02-08 07:38:00 35.1 31.0 96.0 29.64 9.0 SW 4.6 NaN Mostly Cloudy False False False False False False False False False False False False False Night Day Day Day 2016 2 7 0 6 Mon 1 (31.0, 39.9] (92.0, 96.0] (-0.14, 9.333] POINT (-84.20558 39.74775) 137876 137876
4 A-5 MapQuest 201.0 2 2016-02-08 07:39:07 2016-02-08 08:09:07 39.627781 -84.188354 NaN NaN 0.010 Accident on McEwen Rd at OH-725 Miamisburg Cen... NaN Miamisburg Centerville Rd R Dayton Montgomery OH 45459 US US/Eastern KMGY 2016-02-08 07:53:00 36.0 33.3 89.0 29.65 6.0 SW 3.5 NaN Mostly Cloudy False False False False False False False False False False False True False Day Day Day Day 2016 2 7 0 6 Mon 1 (31.0, 39.9] (87.0, 92.0] (-0.14, 9.333] POINT (-84.18835 39.62778) 136431 136431
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2974330 A-2974354 Bing NaN 2 2019-08-23 18:03:25 2019-08-23 18:32:01 34.002480 -117.379360 33.99888 -117.37094 0.543 At Market St - Accident. NaN Pomona Fwy E R Riverside Riverside CA 92501 US US/Pacific KRAL 2019-08-23 17:53:00 86.0 86.0 40.0 28.92 10.0 W 13.0 0.00 Fair False False False False False False False False False False False False False Day Day Day Day 2019 8 18 4 34 Fri 1 (82.4, 87.8] (38.0, 45.0] (9.333, 18.667] POINT (-117.37936 34.00248) 85406 85406
2974331 A-2974355 Bing NaN 2 2019-08-23 19:11:30 2019-08-23 19:38:23 32.766960 -117.148060 32.76555 -117.15363 0.338 At Camino Del Rio/Mission Center Rd - Accident. NaN I-8 W R San Diego San Diego CA 92108 US US/Pacific KMYF 2019-08-23 18:53:00 70.0 70.0 73.0 29.39 10.0 SW 6.0 0.00 Fair False False False False False False False False False False False False False Day Day Day Day 2019 8 19 4 34 Fri 1 (69.8, 73.0] (70.0, 74.0] (9.333, 18.667] POINT (-117.14806 32.76696) 74564 74564
2974332 A-2974356 Bing NaN 2 2019-08-23 19:00:21 2019-08-23 19:28:49 33.775450 -117.847790 33.77740 -117.85727 0.561 At Glassell St/Grand Ave - Accident. in the ri... NaN Garden Grove Fwy R Orange Orange CA 92866 US US/Pacific KSNA 2019-08-23 18:53:00 73.0 73.0 64.0 29.74 10.0 SSW 10.0 0.00 Partly Cloudy False False False False True False False False False False False False False Day Day Day Day 2019 8 19 4 34 Fri 1 (69.8, 73.0] (60.0, 65.0] (9.333, 18.667] POINT (-117.84779 33.77545) 83231 83231
2974333 A-2974357 Bing NaN 2 2019-08-23 19:00:21 2019-08-23 19:29:42 33.992460 -118.403020 33.98311 -118.39565 0.772 At CA-90/Marina Fwy/Jefferson Blvd - Accident. NaN San Diego Fwy S R Culver City Los Angeles CA 90230 US US/Pacific KSMO 2019-08-23 18:51:00 71.0 71.0 81.0 29.62 10.0 SW 8.0 0.00 Fair False False False False False False False False False False False False False Day Day Day Day 2019 8 19 4 34 Fri 1 (69.8, 73.0] (79.0, 83.0] (9.333, 18.667] POINT (-118.40302 33.99246) 85393 85393
2974334 A-2974358 Bing NaN 2 2019-08-23 18:52:06 2019-08-23 19:21:31 34.133930 -117.230920 34.13736 -117.23934 0.537 At Highland Ave/Arden Ave - Accident. NaN CA-210 W R Highland San Bernardino CA 92346 US US/Pacific KSBD 2019-08-23 20:50:00 79.0 79.0 47.0 28.63 7.0 SW 7.0 0.00 Fair False False False False False False False False False False False False False Day Day Day Day 2019 8 18 4 34 Fri 1 (75.2, 79.0] (45.0, 50.0] (-0.14, 9.333] POINT (-117.23092 34.13393) 86854 86854

2974366 rows × 62 columns

In [ ]:
all_join2 = join.reset_index()

# add new ID
all_join2.rename(columns = {'index':'New_ID'}, inplace = True)

# round time to hour
all_join2['Time_round'] = pd.to_datetime(all_join2['Start_Time']).dt.floor('60T')

# drop useless featue
all_model = all_join2.drop(['ID', 'Start_Lat','End_Lat','End_Lng','Start_Lng','Start_Time',
                            'End_Time','geometry','Source','TMC','Description','Number','Zipcode','Wind_Chill(F)',
                           'Country','Timezone','Airport_Code','Weather_Timestamp','Civil_Twilight','Astronomical_Twilight',
                           'index_right'], axis=1)
# dummy severity
all_model['Severity_1'] = np.where(all_model['Severity']==1, 1, 0)
all_model['Severity_2'] = np.where(all_model['Severity']==2, 1, 0)
all_model['Severity_3'] = np.where(all_model['Severity']==3, 1, 0)
all_model['Severity_4'] = np.where(all_model['Severity']==4, 1, 0)

all_model['Severity_3_4'] =all_model['Severity_3'] +all_model['Severity_4']

all_model['reference'] = all_model.Grid_ID.astype(str) + all_model.Time_round.astype(str)+all_model.Severity_3_4.astype(str)

3 - Tackle NA

In [ ]:
# tackle NA
all_model['precipitation'] = all_model['Precipitation(in)'].fillna(0, inplace=False)
all_model = all_model.dropna(subset=['City','Temperature(F)','Wind_Speed(mph)','Pressure(in)','Humidity(%)',
                                    'Weather_Condition','Visibility(mi)','Wind_Direction']) 
all_model = all_model.drop(['Wind_Direction', 'Precipitation(in)','tem_qcut','humidity_qcut',
                           'visi_qcut'], axis=1)
In [ ]:
NA_count = all_model.isna().sum().reset_index()
NA_count.columns = ['index', 'Value']
NA_count['count'] = (NA_count['Value']*100)/len(df)
NA_count 
Out[ ]:
index Value count
0 New_ID 0 0.000000
1 Severity 0 0.000000
2 Distance(mi) 0 0.000000
3 Street 0 0.000000
4 Side 0 0.000000
5 City 0 0.000000
6 County 0 0.000000
7 State 0 0.000000
8 Temperature(F) 0 0.000000
9 Humidity(%) 0 0.000000
10 Pressure(in) 0 0.000000
11 Visibility(mi) 0 0.000000
12 Wind_Speed(mph) 0 0.000000
13 Weather_Condition 0 0.000000
14 Amenity 0 0.000000
15 Bump 0 0.000000
16 Crossing 0 0.000000
17 Give_Way 0 0.000000
18 Junction 0 0.000000
19 No_Exit 0 0.000000
20 Railway 0 0.000000
21 Roundabout 0 0.000000
22 Station 0 0.000000
23 Stop 0 0.000000
24 Traffic_Calming 0 0.000000
25 Traffic_Signal 0 0.000000
26 Turning_Loop 0 0.000000
27 Sunrise_Sunset 3 0.000101
28 Nautical_Twilight 3 0.000101
29 year 0 0.000000
30 month 0 0.000000
31 hour 0 0.000000
32 weekday 0 0.000000
33 week 0 0.000000
34 weekday2 0 0.000000
35 ref 0 0.000000
36 Grid_ID 0 0.000000
37 Time_round 0 0.000000
38 Severity_1 0 0.000000
39 Severity_2 0 0.000000
40 Severity_3 0 0.000000
41 Severity_4 0 0.000000
42 precipitation 0 0.000000

4 - Group by & Count accidents

In [ ]:
# caluclate mean 
pre = all_model.groupby(['reference']).mean()[['precipitation','Temperature(F)',
                                                                    'Visibility(mi)','Humidity(%)',
                                                                    'Distance(mi)']].reset_index()

pre.columns = ['reference', 'mean_precipitation','mean_Temp','mean_Visibility','mean_Humidity','mean_Distance']
#pre['reference'] = pre.Grid_ID.astype(str) + pre.Time_round.astype(str)+pre.Severity.astype(str)
In [ ]:
# count severe accidents
count1 = all_model.groupby(['reference','Grid_ID','Severity_3_4','year','month','week','weekday','hour']).count()['ref'].reset_index()
#count1.sort_values(by=['ref'],ascending=False)
#pd.value_counts(count1['ref']).reset_index()
In [ ]:
# count prequency of POI within this 
all_model.loc[:, 'Amenity':'Turning_Loop'] *=1
#select_grid = all_model['Grid_ID'].drop_duplicates(keep='first').reset_index()
count2 = all_model.groupby(['Grid_ID','Severity_3_4'
                           ]).count().loc[:, 'Amenity':'Turning_Loop'].reset_index()
In [ ]:
# select unique grids and its city & county & State
select_grid = all_model[['Grid_ID','State','City',]].drop_duplicates(subset='Grid_ID',keep='first').reset_index()

5 - Merge all together

In [ ]:
all_merge = pd.merge (count1,pre, how='right',on='reference')
all_merge = pd.merge (all_merge, select_grid[['Grid_ID','State','City']], how='left',on='Grid_ID')
all_merge = pd.merge (all_merge, count2, how='left',on=['Grid_ID','Severity_3_4'])
all_merge
Out[ ]:
reference Grid_ID Severity_3_4 year month week weekday hour ref mean_precipitation mean_Temp mean_Visibility mean_Humidity mean_Distance State City Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop
0 1000072017-06-17 04:00:001 100007 1 2017 6 24 5 4 2 0.00 60.1 10.0 14.0 0.6305 NM Santa Fe 7 7 7 7 7 7 7 7 7 7 7 7 7
1 1000072018-08-05 14:00:000 100007 0 2018 8 31 6 14 1 0.00 88.0 10.0 24.0 0.0000 NM Santa Fe 4 4 4 4 4 4 4 4 4 4 4 4 4
2 1000072018-11-04 11:00:001 100007 1 2018 11 44 6 11 1 0.00 46.9 10.0 48.0 0.0000 NM Santa Fe 7 7 7 7 7 7 7 7 7 7 7 7 7
3 1000072018-11-29 13:00:001 100007 1 2018 11 48 3 13 1 0.00 53.1 10.0 23.0 0.0000 NM Santa Fe 7 7 7 7 7 7 7 7 7 7 7 7 7
4 1000072018-12-26 14:00:001 100007 1 2018 12 52 2 14 1 0.09 34.0 9.0 89.0 0.0000 NM Santa Fe 7 7 7 7 7 7 7 7 7 7 7 7 7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2200725 999322019-05-08 09:00:000 99932 0 2019 5 19 2 9 1 0.00 45.0 10.0 82.0 0.0000 AZ Williams 2 2 2 2 2 2 2 2 2 2 2 2 2
2200726 999322019-05-22 13:00:000 99932 0 2019 5 21 2 13 1 0.00 41.0 10.0 67.0 0.0000 AZ Williams 2 2 2 2 2 2 2 2 2 2 2 2 2
2200727 999392017-08-16 10:00:000 99939 0 2017 8 33 2 10 1 0.00 68.0 10.0 24.0 0.0000 AZ Flagstaff 1 1 1 1 1 1 1 1 1 1 1 1 1
2200728 999742018-10-19 07:00:000 99974 0 2018 10 42 4 7 1 0.00 32.0 10.0 100.0 0.0000 NM Gallup 1 1 1 1 1 1 1 1 1 1 1 1 1
2200729 999972018-02-06 08:00:000 99997 0 2018 2 6 1 8 1 0.00 51.8 10.0 26.0 0.0000 NM San Ysidro 1 1 1 1 1 1 1 1 1 1 1 1 1

2200730 rows × 29 columns

6 - Convert categorical into one hot vectors & scale

In [ ]:
d_weekday = pd.get_dummies(all_merge2.weekday, prefix ='weekday')
d_year = pd.get_dummies(all_merge2.year, prefix ='year')
#d_month = pd.get_dummies(all_merge2.month, prefix ='month')
d_week = pd.get_dummies(all_merge2.week, prefix ='week')
d_hour = pd.get_dummies(all_merge2.hour, prefix ='hour')
d_state = pd.get_dummies(all_merge2.State, prefix ='state')
In [ ]:
#numeric = all_merge.drop(['reference','ref','State','weekday','year',
#                         'month','week','hour','city',], axis=1)
all_c_n = pd.concat([all_merge, d_weekday, d_year,
                     d_week, d_hour, d_state],axis = 1).drop(['reference','ref','State','weekday','year',
                          'month','week','hour','City','State'], axis=1)
all_c_n
Out[ ]:
Grid_ID Severity_3_4 mean_precipitation mean_Temp mean_Visibility mean_Humidity mean_Distance Amenity Bump Crossing Give_Way Junction No_Exit Railway Roundabout Station Stop Traffic_Calming Traffic_Signal Turning_Loop weekday_0 weekday_1 weekday_2 weekday_3 weekday_4 weekday_5 weekday_6 year_2016 year_2017 year_2018 year_2019 week_1 week_2 week_3 week_4 week_5 week_6 week_7 week_8 week_9 week_10 week_11 week_12 week_13 week_14 week_15 week_16 week_17 week_18 week_19 week_20 week_21 week_22 week_23 week_24 week_25 week_26 week_27 week_28 week_29 week_30 week_31 week_32 week_33 week_34 week_35 week_36 week_37 week_38 week_39 week_40 week_41 week_42 week_43 week_44 week_45 week_46 week_47 week_48 week_49 week_50 week_51 week_52 hour_0 hour_1 hour_2 hour_3 hour_4 hour_5 hour_6 hour_7 hour_8 hour_9 hour_10 hour_11 hour_12 hour_13 hour_14 hour_15 hour_16 hour_17 hour_18 hour_19 hour_20 hour_21 hour_22 hour_23 state_AL state_AR state_AZ state_CA state_CO state_CT state_DC state_DE state_FL state_GA state_IA state_ID state_IL state_IN state_KS state_KY state_LA state_MA state_MD state_ME state_MI state_MN state_MO state_MS state_MT state_NC state_ND state_NE state_NH state_NJ state_NM state_NV state_NY state_OH state_OK state_OR state_PA state_RI state_SC state_SD state_TN state_TX state_UT state_VA state_VT state_WA state_WI state_WV state_WY
0 100007 1 0.00 60.1 10.0 14.0 0.6305 7 7 7 7 7 7 7 7 7 7 7 7 7 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 100007 0 0.00 88.0 10.0 24.0 0.0000 4 4 4 4 4 4 4 4 4 4 4 4 4 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2 100007 1 0.00 46.9 10.0 48.0 0.0000 7 7 7 7 7 7 7 7 7 7 7 7 7 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
3 100007 1 0.00 53.1 10.0 23.0 0.0000 7 7 7 7 7 7 7 7 7 7 7 7 7 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
4 100007 1 0.09 34.0 9.0 89.0 0.0000 7 7 7 7 7 7 7 7 7 7 7 7 7 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
2200725 99932 0 0.00 45.0 10.0 82.0 0.0000 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2200726 99932 0 0.00 41.0 10.0 67.0 0.0000 2 2 2 2 2 2 2 2 2 2 2 2 2 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2200727 99939 0 0.00 68.0 10.0 24.0 0.0000 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2200728 99974 0 0.00 32.0 10.0 100.0 0.0000 1 1 1 1 1 1 1 1 1 1 1 1 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
2200729 99997 0 0.00 51.8 10.0 26.0 0.0000 1 1 1 1 1 1 1 1 1 1 1 1 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

2200730 rows × 156 columns

In [ ]:
pd.set_option('display.max_row', None)
all_c_n.dtypes
Out[ ]:
Grid_ID                 int64
Severity_3_4            int64
mean_precipitation    float64
mean_Temp             float64
mean_Visibility       float64
mean_Humidity         float64
mean_Distance         float64
Amenity                 int64
Bump                    int64
Crossing                int64
Give_Way                int64
Junction                int64
No_Exit                 int64
Railway                 int64
Roundabout              int64
Station                 int64
Stop                    int64
Traffic_Calming         int64
Traffic_Signal          int64
Turning_Loop            int64
weekday_0               uint8
weekday_1               uint8
weekday_2               uint8
weekday_3               uint8
weekday_4               uint8
weekday_5               uint8
weekday_6               uint8
year_2016               uint8
year_2017               uint8
year_2018               uint8
year_2019               uint8
week_1                  uint8
week_2                  uint8
week_3                  uint8
week_4                  uint8
week_5                  uint8
week_6                  uint8
week_7                  uint8
week_8                  uint8
week_9                  uint8
week_10                 uint8
week_11                 uint8
week_12                 uint8
week_13                 uint8
week_14                 uint8
week_15                 uint8
week_16                 uint8
week_17                 uint8
week_18                 uint8
week_19                 uint8
week_20                 uint8
week_21                 uint8
week_22                 uint8
week_23                 uint8
week_24                 uint8
week_25                 uint8
week_26                 uint8
week_27                 uint8
week_28                 uint8
week_29                 uint8
week_30                 uint8
week_31                 uint8
week_32                 uint8
week_33                 uint8
week_34                 uint8
week_35                 uint8
week_36                 uint8
week_37                 uint8
week_38                 uint8
week_39                 uint8
week_40                 uint8
week_41                 uint8
week_42                 uint8
week_43                 uint8
week_44                 uint8
week_45                 uint8
week_46                 uint8
week_47                 uint8
week_48                 uint8
week_49                 uint8
week_50                 uint8
week_51                 uint8
week_52                 uint8
hour_0                  uint8
hour_1                  uint8
hour_2                  uint8
hour_3                  uint8
hour_4                  uint8
hour_5                  uint8
hour_6                  uint8
hour_7                  uint8
hour_8                  uint8
hour_9                  uint8
hour_10                 uint8
hour_11                 uint8
hour_12                 uint8
hour_13                 uint8
hour_14                 uint8
hour_15                 uint8
hour_16                 uint8
hour_17                 uint8
hour_18                 uint8
hour_19                 uint8
hour_20                 uint8
hour_21                 uint8
hour_22                 uint8
hour_23                 uint8
state_AL                uint8
state_AR                uint8
state_AZ                uint8
state_CA                uint8
state_CO                uint8
state_CT                uint8
state_DC                uint8
state_DE                uint8
state_FL                uint8
state_GA                uint8
state_IA                uint8
state_ID                uint8
state_IL                uint8
state_IN                uint8
state_KS                uint8
state_KY                uint8
state_LA                uint8
state_MA                uint8
state_MD                uint8
state_ME                uint8
state_MI                uint8
state_MN                uint8
state_MO                uint8
state_MS                uint8
state_MT                uint8
state_NC                uint8
state_ND                uint8
state_NE                uint8
state_NH                uint8
state_NJ                uint8
state_NM                uint8
state_NV                uint8
state_NY                uint8
state_OH                uint8
state_OK                uint8
state_OR                uint8
state_PA                uint8
state_RI                uint8
state_SC                uint8
state_SD                uint8
state_TN                uint8
state_TX                uint8
state_UT                uint8
state_VA                uint8
state_VT                uint8
state_WA                uint8
state_WI                uint8
state_WV                uint8
state_WY                uint8
dtype: object

IV. Random Forest Model

1 - Train & test dataset

In [ ]:
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import GridSearchCV

label = all_c_n['Severity_3_4']
features = all_c_n.drop('Severity_3_4',axis = 1)

from sklearn.model_selection import train_test_split
# Your code goes here
x_train, x_test, y_train, y_test = train_test_split(features, \
                                                    label, \
                                                    test_size=0.20)

2 - Modeling & fitting

This project applied the Random forest to train the model. After several attempts, the accuracy is about 60% while only with 100 estimators and max_detph is only 20. If we increase the n_estimators and max_depth, the accuracy of the model will undoubtedly increase. Due to the limited time & laptop storage memory & unstable environment, I didn't try other scenarios (But need to for next step).

In [ ]:
clf = RandomForestRegressor(n_estimators = 50, max_depth=15)
clf.fit(x_train,y_train)
#5:16 40min
Out[ ]:
RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=15, max_features='auto', max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=50, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
In [ ]:
y_pred = clf.predict(x_test)
clf.score(x_test,y_test)
Out[ ]:
0.39475127070354055
In [ ]:
clf2 = RandomForestRegressor(n_estimators = 100, max_depth=20)
clf2.fit(x_train,y_train)

y_pred = clf2.predict(x_test)
clf2.score(x_test,y_test)
Out[ ]:
0.5935807021937516

3 - Feature Importance

The plot shows the importance of each feature. It shows that Feature 0, aka, Grid_ID has the most strong predictability. Although put Grid_ID as a numeric feature into the model is not such accurate, to some extent it takes the geographic location into consideration.

In [ ]:
importances = clf2.feature_importances_
std = np.std([tree.feature_importances_ for tree in clf.estimators_],
             axis=0)
indices = np.argsort(importances)[::-1]

for f in range(x_train.shape[1]):
    print("%d. feature %d (%f)" % (f + 1, indices[f], importances[indices[f]]))
In [ ]:
plt.figure(figsize=(40,12))
plt.title("Feature importances")
plt.bar(range(x_train.shape[1]), importances[indices],
       color="r", yerr=std[indices], align="center")
plt.xticks(range(x_train.shape[1]), indices)
plt.xlim([-1, x_train.shape[1]])
plt.show()

V. Challenges & Next:

Challenges

  1. Analysis/Prediction Unit: The purpose of this project is to figure out which area in the USA is risky for the driver, and built a model to predict the potential future. Thus, this project cares more about the cluster of accidents rather than the single case (which might be caused by accident). Thus, this project applied raster (10km*10km grid) as the basic unit to count the number of accidents.

  2. Big data: Firstly, the dataset has 3millon rows. Secondly, it's obvious that the exploration of accidents' geographic location will be helpful for the prediction, while it's not an easy task to the plot or play with the geodata frame who has 3 million rows. Even though this project use raster to make it easy and count the number of accidents within the grid (side=10km, 30000+),the model considered the time-related variables, e.g. month, week of day, hour, which also expand the data since the model have to record the number of accidents within the grid at a certain time. Apart from that,

  3. Difficulty in finding other sources: To find other related variables is also not an easy task when the project is based on the whole USA, for example county-level census data, neighbourhood-level POI (point of interest) cluster data, road data (e.g. speed limitation, road width, turning curve rate, etc.

Next

  1. In consideration of above challenges, next step, this project should zoom into one city or State, as within the same city or State the data will be more consistent and accessible and the data size will shrink a lot to save the memory for other related features.

  2. For this project, I used raster rather than the street because of the coding limitation, while to increase the accuracy. It seems that current grids are too big to apply the model into the reality. Thus, next step is to switch the analysis unit to street itself rathr than the spatial grid: one possible method is to cut all streets by 1km and predicted the probabilty of accidents at each hour (still classification problem). In that case, a large numebr of interesting vairables can be added into the model, like road type, street width, speed limitation, distance to nearby intersection, even the number of billboars within this 1km street (which will district the driver).